Project Author: Alexander Lacson
This is the continuation of a project where I analyze data from OKCupid. The first part is here. I felt like the section on predictive models from the previous work needed improvement. This notebook is a revision and expansion of that section.
The diagram below comes from scikitlearn docs. This is the procedure that we will follow. The left section outlines model parameter tuning process. The right section outlines the model test setup and evaluation process.

#Configure if model parameter tuning will be executed:
execute_dt_search = True
execute_rf_search = True
execute_lr_search = True
import pandas as pd
from scipy.sparse import coo_matrix
df = pd.read_csv('ml_ready_data.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 59811 entries, 0 to 59810 Columns: 738 entries, Unnamed: 0 to speaks_yiddish (poorly) dtypes: int64(728), object(10) memory usage: 336.8+ MB
def feature_selection_to_list(df, cat_selection, numeric_selection):
categorical_feats = []
for each in cat_selection:
categorical_feats = categorical_feats + df.loc[:, df.columns.str.startswith(each)].columns.to_list()
return categorical_feats + numeric_selection
cat_selection = ['body_type', 'drinks', 'drugs', 'education', 'job', 'orientation',
'smokes', 'status', 'diet_adherence', 'diet_type', 'last_online',
'offspring_want', 'offspring_attitude', 'religion_type', 'religion_attitude',
'sign_type', 'sign_attitude', 'dog_preference', 'cat_preference', 'has_dogs',
'has_cats', 'ethnicity', 'speaks']
numeric_selection = ['age', 'height']
feature_selection = feature_selection_to_list(df, cat_selection, numeric_selection)
predictors = df[feature_selection]
predictor_legend = ['Male Predictor', 'Female Predictor']
labels = df.sex
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import plotly.express as px
import numpy as np
import time
I suppose some explanations are necessary before you dive in below. GridSearchCV is a function for repeatedly building a model using different combinations of a set of specified parameter values. In addition, GridsearchCV also makes use of StratifiedKfold, to report a model score. It is your decision which evaluation metric to use as the score, but by default accuracy is used. If you do not pass an argument to the cv parameter of your GridSearchCV function, it uses k=5, or 5-fold cross validation, by default. However, if you want to specify the parameters, or choose a different validator such as RepeatStratifiedKfold, then you pass a validator function, complete with specified parameters, as the argument of cv. By not passing an argument below I have used the default cv.
It's a good idea to reconsider what our project objective was. Our original goal was "Use Machine Learning to predict gender". In this case we don't want the model to discriminate against gender. As much as possible we want it to perform equally well for either class. Which is why the metric we will use as the model score for GridSearchCV will be accuracy, its default.
#Log Tuning Time
start = time.time()
C represents regularization) strength.
if execute_lr_search:
model = GridSearchCV(estimator=LogisticRegression(),
param_grid={'C': list(range(1, 11))})
pipe_lr = Pipeline([('scaler', StandardScaler()), ('LogisticRegression', model)])
pipe_lr.fit(predictors, labels)
if execute_lr_search:
print("The best score is: {}".format(pipe_lr['LogisticRegression'].best_score_))
print("The best parameters are: {}".format(pipe_lr['LogisticRegression'].best_params_))
fig = px.scatter(pd.DataFrame(pipe_lr['LogisticRegression'].cv_results_), x='param_C', y='mean_test_score', title = 'Different values of C')
fig.data[0].update(mode='markers+lines')
fig.show()
The best score is: 0.8885489343130564
The best parameters are: {'C': 1}
Max Tree Depth
if execute_dt_search:
model = GridSearchCV(estimator=DecisionTreeClassifier(),
param_grid={'max_depth': list(range(3, 100))})
pipe_dt = Pipeline([('scaler', StandardScaler()), ('DecisionTreeClassifier', model)])
pipe_dt.fit(predictors, labels)
if execute_dt_search:
print(pipe_dt['DecisionTreeClassifier'].best_score_)
print(pipe_dt['DecisionTreeClassifier'].best_params_)
fig = px.scatter(pd.DataFrame(pipe_dt['DecisionTreeClassifier'].cv_results_), x='param_max_depth', y='mean_test_score', title = 'Different tree depths')
fig.data[0].update(mode='markers+lines')
fig.show()
fig.write_html("Decision_tree_depths.html")
0.8651586036827619
{'max_depth': 7}
Max Tree Depth and Number of Trees in the Forest
if execute_rf_search:
model = GridSearchCV(estimator=RandomForestClassifier(),
param_grid={'max_depth': list(range(6, 40)), 'n_estimators': [150, 180, 300]})
pipe_rf = Pipeline([('scaler', StandardScaler()), ('rfc', model)])
pipe_rf.fit(predictors, labels)
if execute_rf_search:
print(pipe_rf['rfc'].best_score_)
print(pipe_rf['rfc'].best_params_)
fig = px.scatter_3d(pd.DataFrame(pipe_rf['rfc'].cv_results_),
x='param_max_depth',
y = 'param_n_estimators',
z='mean_test_score',
color = 'mean_test_score',
color_continuous_scale = 'Viridis',
title = 'Different Random Forest Parameters')
fig.show()
fig.write_html("Random_forest_grid_crossval.html")
0.883750457350503
{'max_depth': 39, 'n_estimators': 300}
The 3D Scatter plot can be rotated, panned, and zoomed.
end = time.time()
duration = round(end-start, 2)
print("Time to tune 3 models was: " + str(duration) + " secs")
Time to tune 3 models was: 24837.2 secs
Note that with the Random Forest we are still gaining better performance with higher max depth, unlike with the Decision Tree which completely drops after peaking at depth = 7.
Let's set aside our test and train sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(predictors, labels, test_size=0.2, random_state=1)
start = time.time()
model = LogisticRegression()
pipe_lr = Pipeline([('scaler', StandardScaler()), ('LogisticRegression', model)])
model = DecisionTreeClassifier(max_depth = 7)
pipe_dt = Pipeline([('scaler', StandardScaler()), ('DecisionTreeClassifier', model)])
model = RandomForestClassifier(max_depth = 34, n_estimators = 300)
pipe_rf = Pipeline([('scaler', StandardScaler()), ('RandomForestClassifier', model)])
trained_pipe_lr = pipe_lr.fit(X_train, y_train)
trained_pipe_dt = pipe_dt.fit(X_train, y_train)
trained_pipe_rf = pipe_rf.fit(X_train, y_train)
end = time.time()
duration = round(end-start, 2)
print("Time to train 3 models was: " + str(duration) + " secs")
Time to train 3 models was: 73.7 secs
lr_predictions = trained_pipe_lr.predict(X_test)
dt_predictions = trained_pipe_dt.predict(X_test)
rf_predictions = trained_pipe_rf.predict(X_test)
model_accuracy_dict = {}
model_list = [trained_pipe_lr, trained_pipe_dt, trained_pipe_rf]
model_names = ['LogisticRegression', 'DecisionTreeClassifier', 'RandomForestClassifier']
for model, name in zip(model_list, model_names):
test_score = model.score(X_test, y_test)
model_accuracy_dict[name] = test_score
accuracy_df = pd.DataFrame(model_accuracy_dict, index = ["Accuracy"])
fig = px.bar(accuracy_df.transpose(), y=model_names, x = "Accuracy", orientation = 'h')
fig.update_xaxes(range=(0.84, 0.9))
fig.update_yaxes(categoryorder = "total ascending")
fig.update_traces(texttemplate='%{value:.3f}', textposition = 'outside')
fig.update_layout(title="Accuracy for Different Models", yaxis_title="Model",)
fig.show()
from sklearn.metrics import confusion_matrix
confusion_matrices = []
for name, predictions in zip(model_names, [lr_predictions, dt_predictions, rf_predictions]):
confusion_df = pd.DataFrame(confusion_matrix(y_test, predictions, normalize = 'true'),
columns = ['pred_Female', 'pred_Male'],
index = ['actual_Female', 'actual_Male'])
confusion_df = confusion_df.transpose()
confusion_df['Model'] = name
confusion_df['actual_Female'] = confusion_df['actual_Female']*100
confusion_df['actual_Male'] = confusion_df['actual_Male']*100
confusion_matrices.append(confusion_df)
confusion_df = pd.concat([matrix for matrix in confusion_matrices])
confusion_df
| actual_Female | actual_Male | Model | |
|---|---|---|---|
| pred_Female | 85.586149 | 8.369368 | LogisticRegression |
| pred_Male | 14.413851 | 91.630632 | LogisticRegression |
| pred_Female | 80.767626 | 9.024969 | DecisionTreeClassifier |
| pred_Male | 19.232374 | 90.975031 | DecisionTreeClassifier |
| pred_Female | 83.166458 | 7.783512 | RandomForestClassifier |
| pred_Male | 16.833542 | 92.216488 | RandomForestClassifier |
import plotly.graph_objects as go
fig = go.Figure(data = [
go.Bar(
x=confusion_df.actual_Male.loc['pred_Male'],
y=confusion_df.Model.loc['pred_Male'],
name = 'Correctly Classified Males',
orientation='h',
offsetgroup = 0,
texttemplate = '%{value:.1f}%',
textposition = 'inside',
hoverinfo = 'none'
),
go.Bar(
x=confusion_df.actual_Male.loc['pred_Female'],
y=confusion_df.Model.loc['pred_Female'],
name = 'Misclassified Males',
orientation='h',
offsetgroup = 0,
base = [each for each in confusion_df.actual_Male.loc['pred_Male']],
texttemplate = '%{value:.1f}%',
textposition = 'inside',
hoverinfo = 'none'
),
go.Bar(
x=confusion_df.actual_Female.loc['pred_Female'],
y=confusion_df.Model.loc['pred_Female'],
name = 'Correctly Classified Females',
orientation='h',
offsetgroup = 1,
texttemplate = '%{value:.1f}%',
textposition = 'inside',
hoverinfo = 'none'
),
go.Bar(
x=confusion_df.actual_Female.loc['pred_Male'],
y=confusion_df.Model.loc['pred_Male'],
name = 'Misclassified Females',
orientation='h',
offsetgroup = 1,
base = [each for each in confusion_df.actual_Female.loc['pred_Female']],
texttemplate = '%{value:.1f}%',
textposition = 'inside',
hoverinfo = 'none'
),
],
layout=go.Layout(
title="Male & Female Class Predictions per Model",
yaxis_title="Model",
xaxis_title = "Percentage"
)
)
fig.update_xaxes(range=(0, 100))
fig.show()
The plot may be zoomed in on the rightmost region, in order to further highlight differences.
In all cases the models perform better at classifying males than females. In the best case, the logistic regression model, men are classified 6.77% times better than women. A consequence of either training on a male-skewed dataset or not having enough reliable features that allow the model to confidently classify women (or both).
This is not without consequence in the real world. I recommend the Netflix documentary "Coded Bias". In the documentary, during a senate hearing, Alexandra Ocasio-Cortez questions Joy Buolamwini. Here is a selected excerpt:
AOC: "What demographic is it [AI models] mostly effective on?"
JB: "White Men"
AOC: "And who are the primary engineers and designers of these algorithms?"
JB: "Definitely white men"
Despite the oral exchange above, it's possible to make a biased AI model without being a "white man". You could be an ML engineer who failed to properly evaluate the performance of your model at identifying all class labels. This issue has already entered the mainstream social and political spheres.
predictor_legend = ['Male Predictor', 'Female Predictor']
fig = px.bar(
y=predictors.columns,
x=abs(trained_pipe_lr['LogisticRegression'].coef_[0]),
color=[predictor_legend[0] if c > 0 else predictor_legend[1] for c in trained_pipe_lr['LogisticRegression'].coef_[0]],
color_discrete_sequence=['red', 'blue'],
labels=dict(x='Predictor', y='Weight/Importance'),
title='Top 20 Predictors of the Logistic Regression Model',
)
fig.update_yaxes(categoryorder = "total ascending", range=(len(predictors.columns) - 20.6, len(predictors.columns)))
fig.show()
fig = px.bar(
y=predictors.columns,
x=trained_pipe_dt['DecisionTreeClassifier'].feature_importances_,
labels=dict(x='Predictor', y='Weight/Importance'),
title='Top 20 Predictors of the Decision Tree Model',
)
fig.update_yaxes(categoryorder = "total ascending", range=(len(predictors.columns) - 20.6, len(predictors.columns)))
fig.show()
fig = px.bar(
y=predictors.columns,
x=trained_pipe_rf['RandomForestClassifier'].feature_importances_,
labels=dict(x='Predictor', y='Weight/Importance'),
title='Top 20 Predictors of the Random Forest Model',
)
fig.update_yaxes(categoryorder = "total ascending", range=(len(predictors.columns) - 20.6, len(predictors.columns)))
fig.show()
For all models, height and body_type_curvy are our top predictors. Probably because men are taller than women on average, and because men are not likely to describe themselves as curvy whereas women are. It is interesting to see that after the top two predictors, there is a different order of feature importances for the random forest compared to the other models.
With Logistic Regression we can conveniently see which features were more useful for predicting class (male or female) because we have negative and positive weight coefficients, unlike with the Random Forest and Decision Tree. We can see that the model has more confidence in its male predictors than in its female predictors.
Although not presented here, most likely, because the age distribution is skewed towards the young, the model is more effective at classifying young people than old people. The way to alleviate this is to apply a power transform to the age distribution, which makes it more like a normal distribution, before training (you will also have to apply the same transform to the test set before predicting). We will then have to evaluate the model's performance with young vs old subsets of our data.
Model evaluation can still be taken steps further. In scikit learn there is an example of applying frequentist and bayesian statistical approaches to more definitively make model comparisons.
Now that you know how to evaluate a model, the challenge is to learn how to improve model performance. Not just in general, but to be fair at identifying all classes. You will rarely ever have a dataset that is perfectly balanced.